# install new packages IN THE CONSOLE!
install.packages("pacman")Data import
Reading in data files
Readings
required reading: Ch. 8 (data import) in Wickham et al. (2023)
supplementary reading: Ch. 4 (data import) in Nordmann & DeBruine (2022)
Review
So far, we’ve learned how to…
- use Quarto scripts for reproducible data analysis
- load in built-in datasets
- wrangle data with
dplyrverbs - visualise distributions and relationships of different variable types
Learning objectives
Today we will learn how to:
- create and store local data files (
.csv) - import local data files with the
readrpackage - handle missing values
- change variables to factors
1 Packages
- we will start using the
pacmanpackage instead ofinstall.packages()andlibrary- the function
p_load()takes package names as arguments - then checks if you have the package installed
- if you do –> loads the package (just like
library()) - if you don’t –> the package is installed and then loaded (like using
install.packages()+library())
- if you do –> loads the package (just like
- the function
- this saves us from having to install new packages individually
# load packages
pacman::p_load(tidyverse, # wrangling
languageR, # linguistic datasets
janitor, # wrangling
here, # relative file paths
patchwork # plot layout
)- we now have
tidyverseloaded, and the new packagesjanitorandhereinstalled and loaded- to find out more about these packages, try typing
?janitorand?herein the Console.
- to find out more about these packages, try typing
2 CSV: Comma separated value
there are many different file types that data can take, e.g., .xlsx, .txt, .csv, .tsv.
csvis the most typical data file type, and stands for: Comma Separated Values.this is what a simple CSV file looks like when viewed as raw text
Student ID,Full Name,favourite.food,mealPlan,AGE
1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4
2,Barclay Lynn,French fries,Lunch only,5
3,Jayendra Lyne,N/A,Breakfast and lunch,7
4,Leon Rossini,Anchovies,Lunch only,
5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five
6,Güvenç Attila,Ice cream,Lunch only,6
- the first row (the “header row”) contains the columns names
- the subsequent rows contain the data
- How many variables are there? how many observations?
2.1 Tidy data
- you want your data to be tidy
- tidy data is rectangular, and:
- each column represents a variable
- each row an observation
- each cell a data point (Figure 1).
- make sure you’re working in the class RProject!
- if you aren’t, follow the exercise on the course website here
3 Spreadsheet to csv
- let’s collect some toy data in a spreadsheet which we will then save as a CSV file and load into R
- Click here to go to an editable spreadsheet
- Enter the relevant information about yourself, or make up some data: the name of a pet you have/had, height, birth month and day, plus your first language. If you have no pet, leave the cell blank.
3.0.1 Save csv
- save the spreadsheet as
groesse_geburtstag.csvto your computer, directly in a folder calleddatenin our project directory
Saving a CSV
Example 1
- Create a new folder called
datenin your project folder (if you haven’t already). - Download the Google Sheet and save it in your
datenfolder asgroesse_geburtstag.csv - Go to your
datenfolder and check that the CSV file is there.
4 The readr package
we now have to read in the data
we have to use a function that reads CSV data, and to specify where the data is in our RProject folder
the
readrpackage (part oftidyverse) can load in most data types, and has multiple functions for different data types
read_csv(here::here("daten", "groesse_geburtstag.csv"))| Haustier | Größe | Monat der Geburt | Tag | L1 |
|---|---|---|---|---|
| Lola | 171 | 5 | 7 | Englisch |
| NA | 168 | 11 | 26 | Deutsch |
| N/A | 182 | 4 | 15 | Deutsch |
readr
Example 2
- Import the
groesse_geburtstag.csvdataset and save it as an object calleddf_groessedf_is short for DataFrame; it’s a good idea to use a prefix before object names so we know what each object contains
- When importing data with
read_csv, some information is printed in the Console. What is printed? - Explore the dataset using the functions like
summary()orhead() - Do you see anything odd?
5 The here package
- how did R know exactly where to find the
datenfolder? - our working directory is set to the location of our RProject within our computer
- whenever we want to access data in our RProject, we should nest our filepath in
here()
- whenever we want to access data in our RProject, we should nest our filepath in
- to see where
here()is starting from, runhere()
here()[1] "/Users/danielapalleschi/Documents/IdSL/Teaching/WiSe2324/B.A./r4ling"
- the output will look different for all of our machines, because it’s relative to where we placed our project folder
6 Working with data
6.1 Missing values
- you might’ve noticed some
NAorN/AvaluesN/Awas written as text in one of our observations, and so R reads it as suchNAs in R refer to missing data (“Not Available”)- true missing values are completely empty, so having
N/Awritten in ourdf_groessedata is not actually read as a missing value - to fix this, we can use the argument
na =for theread_csv()function, which tellsread_csv()which values it should equate with missing values
# force "N/A" to missing values
df_groesse <- read_csv(here::here("daten", "groesse_geburtstag.csv"),
na = "N/A")# print the head of the data set
head(df_groesse)# A tibble: 3 × 5
Haustier Größe `Monat der Geburt` Tag L1
<chr> <dbl> <dbl> <dbl> <chr>
1 "Lola" 171 5 7 Englisch
2 "" 168 11 26 Deutsch
3 <NA> 182 4 15 Deutsch
- the value that was formerly
""is read as anNA - but what about the empty cell? We have now overwritten
read_csv()reading empty cells asNA- now we want to tell
read_csv()to read more than one type of input asNA, i.e., we want to tell it to read""and"N/A"asNA - for this, we use our always useful concatenate function:
c().
- now we want to tell
# force "N/A" and empty cells to missing values
df_groesse <- read_csv(here::here("daten", "groesse_geburtstag.csv"),
na = c("N/A",""))# print the head of the data set
head(df_groesse)# A tibble: 3 × 5
Haustier Größe `Monat der Geburt` Tag L1
<chr> <dbl> <dbl> <dbl> <chr>
1 Lola 171 5 7 Englisch
2 <NA> 168 11 26 Deutsch
3 <NA> 182 4 15 Deutsch
6.2 Column names
- one column name in our data is surrounded by backticks (e.g.,
`Monat der Geburt`)- this is because it contains an empty space, which is not syntactically valid
- a quick fix is to the function
clean_names()from thejanitorpackage, which we’ve already loaded in
clean_names(df_groesse)# A tibble: 3 × 5
haustier grosse monat_der_geburt tag l1
<chr> <dbl> <dbl> <dbl> <chr>
1 Lola 171 5 7 Englisch
2 <NA> 168 11 26 Deutsch
3 <NA> 182 4 15 Deutsch
- that looks better! But if you now run
head(df_groesse), do you see the cleaned column names? - you shouldn’t because when we pass an object through a function, the object is not ‘updated’
- so we have to again assign the object using the assignment operator
<-
- so we have to again assign the object using the assignment operator
Code
df_groesse <- janitor::clean_names(df_groesse)6.3 Pipes
- pipes are placed at the end of function call when the result of this function should be passed through a subsequent function. Pipes can be read as “and then…”.
read_csv(here::here("daten", "groesse_geburtstag.csv")) |>
head()# A tibble: 3 × 5
Haustier Größe `Monat der Geburt` Tag L1
<chr> <dbl> <dbl> <dbl> <chr>
1 Lola 171 5 7 Englisch
2 <NA> 168 11 26 Deutsch
3 N/A 182 4 15 Deutsch
There are currently 2 pipes that can be used in R.
- the
magrittrpackage pipe:%>% - the new native R pipe:
|>
- there aren’t any major differences that are important for our current uses
- you can use the keyboard shortcut
Cmd/Ctrl + Shift/Strg + Mto produce a pipe
Example 3
- Load the
groesse_geburtstag.csvdataset again with fixedNAs and then- Use a pipe to call
clean_names()on the dataset, and then - call the
head()function - check the number of observations and variables, is there an issue?
- Use a pipe to call
- Load the
groesse_geburtstag.csvdataset again with fixedNAs, saving it as the objectdf_groesse, and then- use a pipe to call
clean_names()on the data set
- use a pipe to call
- Why shouldn’t you use a pipe and the
head()function when you’re saving the dataset as an object?
6.4 Other file types
readrhas other functions which are also easy to use, you just have to know when to use which onesread_csv2()reads semicolon-separated csv files (;)- this file type is common in countries that use
,as the decimal marker (like Germany)
- this file type is common in countries that use
read_tsv()reads tab-delimited filesread_delim()function reads in files with any delimiter- it will try to guess the delimiter unless you specify it with the argument
delim =(e.g.,read_delim(groesse_geburtstag.csv, delim = ","))
- it will try to guess the delimiter unless you specify it with the argument
7 Working with variables
- the major columns types to know are
numericalandfactor(categorical) - factors contain categories or groups of data, but can sometimes look like
numericaldata- for example, our column
monthcontains numbers, but it could also contain the name of each month - a good way to know which is which: it makes sense to calculate the mean of a
numericalvariable, but not of afactor - for example, it makes sense to calculate our average height, but not our average birth month
- for example, our column
as_factor()
- we can use the
as_factor()function to change a variable type to factor - We can either use base R syntax to do this by using an
$to index a column in a dataframe:
# mit base R
df_groesse$monat_der_geburt <- as_factor(df_groesse$monat_der_geburt)- or we can use
tidyversesyntax and themutate()function
# mit tidyverse
df_groesse <-
df_groesse |>
mutate(monat_der_geburt = as_factor(monat_der_geburt))Learning objectives 🏁
Today we learned how to…
- import local data files with the
readrpackage ✅ - handle missing values ✅
- change variables to factors ✅
Let’s now put this new knowledge to use.
Homework
Let’s now practice using the readr package and wrangling our data.
readr functions
- What function would you use to read a file where fields were separated with “
|”? - What arguments do
read_csv()andread_tsv()have in common? - Which function(s) could you use to load in a dataset with a semicolon (
;) as delimiter?
Data wrangling
Re-load the groesse_geburtstag.csv file. Use pipes to also use the clean_names function, and to make the following changes in the object df_groesse:
- Convert the variable
l1to a factor. - Rename
grossetogroessemonat_der_geburttogeburtsmonat
Plots
- produce a scatterplot using our
df_groessedataset, visualising the relationship between our birth day and our birth days (this doesn’t make sense to compare, but it’s just an exercise). Set the colour and shape to correspond toL1. Add a plot title. - Find your birthday on the plot.
- Produce a barplot showing the number of observations per
l1
Session Info
Hergestellt mit R version 4.3.0 (2023-04-21) (Already Tomorrow) und RStudioversion 2023.9.0.463 (Desert Sunflower).
sessionInfo()R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Europe/Berlin
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] magick_2.7.4 patchwork_1.1.3 here_1.0.1 janitor_2.2.0
[5] languageR_1.5.0 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
[9] dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0
[13] tibble_3.2.1 ggplot2_3.4.3 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] utf8_1.2.3 generics_0.1.3 stringi_1.7.12 hms_1.1.3
[5] digest_0.6.33 magrittr_2.0.3 evaluate_0.21 grid_4.3.0
[9] timechange_0.2.0 fastmap_1.1.1 rprojroot_2.0.3 jsonlite_1.8.7
[13] fansi_1.0.4 scales_1.2.1 cli_3.6.1 crayon_1.5.2
[17] rlang_1.1.1 bit64_4.0.5 munsell_0.5.0 withr_2.5.0
[21] yaml_2.3.7 parallel_4.3.0 tools_4.3.0 tzdb_0.4.0
[25] colorspace_2.1-0 pacman_0.5.1 vctrs_0.6.3 R6_2.5.1
[29] lifecycle_1.0.3 snakecase_0.11.0 htmlwidgets_1.6.2 bit_4.0.5
[33] vroom_1.6.3 pkgconfig_2.0.3 pillar_1.9.0 gtable_0.3.4
[37] Rcpp_1.0.11 glue_1.6.2 xfun_0.39 tidyselect_1.2.0
[41] rstudioapi_0.14 knitr_1.44 htmltools_0.5.5 rmarkdown_2.22
[45] compiler_4.3.0